Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and add incremental workflow functionality. #78

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

haileyplusplus
Copy link
Collaborator

@haileyplusplus haileyplusplus commented Apr 1, 2024

Description

Add the ability to incrementally update data, refactoring the processing workflow to make this feasible.

This refactors the existing logic into multiple new classes and splits realtime and schedule processing
as much as possible. Included are a few other features:

  • writes out updated GeoJSON file (data.json) to data_output/scratch/frontend_data_to_wk.json.
  • transitfeeds.com is no longer being updated, so this now loads CTA schedules from the public S3 bucket as well.
  • If transitfeeds.com data isn't needed, don't scrape it at all.
  • Saves data downloaded from s3 and other sources to save on bandwidth.
  • update_data.py now takes command-line arguments.
  • Minor changes to progress bar display for better readability.

Resolves # [issue]

Type of change

  • Bug fix
  • New functionality
  • Documentation

How has this been tested?

Manually

@haileyplusplus
Copy link
Collaborator Author

Some documentation polish is still needed.

@lauriemerrell
Copy link
Member

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.

Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some
days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

@haileyplusplus
Copy link
Collaborator Author

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.

Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some
days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

Thanks, I hadn't realized that. Good to know.

@lauriemerrell
Copy link
Member

Reviewer TODO to myself: check how the date boundary looks with the change in logic

@haileyplusplus
Copy link
Collaborator Author

The download cache part of this is now in PR #80

@haileyplusplus
Copy link
Collaborator Author

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.
Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some
days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

Thanks, I hadn't realized that. Good to know.

I decided to keep the existing boundary behavior here. I updated the PR so that the new code that pulls scraped schedules from S3 behaves consistently.

@haileyplusplus haileyplusplus linked an issue Apr 25, 2024 that may be closed by this pull request
@haileyplusplus haileyplusplus marked this pull request as ready for review April 25, 2024 23:09
@haileyplusplus
Copy link
Collaborator Author

haileyplusplus commented Apr 25, 2024

Ready for review now! Comments welcome.

This is now basically a superset of PR #80, so if desired you could just review this.

@haileyplusplus
Copy link
Collaborator Author

For easier reviewing, try
git diff origin/main --color-moved-ws=allow-indentation-change --color-moved=blocks

Copy link
Member

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?

Specifically, when I try to run update_data, I get the following error:

python3 -m update_data                     
INFO:root: Searching page 1
INFO:root: Searching page 2
INFO:root: Searching page 3
INFO:root: Searching page 4
INFO:root: Found schedule for May 2022
INFO:root: Adding schedule for May 7, 2022
INFO:root:Processing 49 schedules.
Traceback (most recent call last):
  File "/Users/laurie/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/laurie/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/laurie/git/hailey-fork-ghost-buses/update_data.py", line 394, in <module>
    main()
  File "/Users/laurie/git/hailey-fork-ghost-buses/update_data.py", line 367, in main
    combined_long_df, summary_df = csrt.main(cache_manager, freq=freq, start_date=start_date, end_date=None,
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 296, in main
    return summarizer.main(existing)
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 282, in main
    this_iter = combiner.combine()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 142, in combine
    schedule = self.schedule_summarizer.get_route_daily_summary()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 109, in get_route_daily_summary
    trip_summary = self.make_trip_summary()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 127, in make_trip_summary
    data = self.download_and_extract()
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 241, in download_and_extract
    cta_gtfs = zipfile.ZipFile(self.gtfs_fetcher.retrieve_file(self.schedule_feed_info))
  File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/gtfs_fetcher.py", line 103, in retrieve_file
    local_filename, _, s3_filename, _ = self.versions[version_id]
KeyError: '20220507'

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

data_analysis/gtfs_fetcher.py Show resolved Hide resolved
data_analysis/gtfs_fetcher.py Show resolved Hide resolved
data_analysis/schedule_manager.py Show resolved Hide resolved
data_analysis/schedule_manager.py Outdated Show resolved Hide resolved
data_analysis/compare_scheduled_and_rt.py Outdated Show resolved Hide resolved
utils/s3_csv_reader.py Outdated Show resolved Hide resolved
update_data.py Show resolved Hide resolved
@haileyplusplus
Copy link
Collaborator Author

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?

Specifically, when I try to run update_data, I get the following error:

...

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:

python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json

@haileyplusplus
Copy link
Collaborator Author

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?
Specifically, when I try to run update_data, I get the following error:

...

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:

python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json

It was a simple fix for the non-arguments version, which I just committed. It should work now.

@haileyplusplus
Copy link
Collaborator Author

Ok, I've updated the readme and addressed open comments. I think this is ready for another round of testing and viewing. Please let me know what you think about deferring the transitfeeds changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automate updates to JSON files
2 participants